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Abstract 

Prediction is a complex notion, and different predictors (such as peo- 
ple, computer programs, and probabilistic theories) can pursue very dif- 
ferent goals. In this paper I will review some popular kinds of prediction 
and argue that the theory of competitive on-line learning can benefit from 
the kinds of prediction that are now foreign to it. 

The standard goal for predictor in learning theory is to incur a small 
loss for a given loss function measuring the discrepancy between the pre- 
dictions and the actual outcomes. Competitive on-line learning concen- 
trates on a "relative" version of this goal: the predictor is to perform 
almost as well as the best strategies in a given benchmark class of pre- 
diction strategies. Such predictions can be interpreted as decisions made 
by a "small" decision maker (i.e., one whose decisions do not affect the 
future outcomes). 

Predictions, or probability forecasts, considered in the foundations of 
probability are statements rather than decisions; the loss function is re- 
placed by a procedure for testing the forecasts. The two main approaches 
to the foundations of probability are measure-theoretic (as formulated by 
Kolmogorov) and game-theoretic (as developed by von Mises and Ville); 
the former is now dominant in mathematical probability theory, but the 
latter appears to be better adapted for uses in learning theory discussed 
in this paper. 

An important achievement of Kolmogorov's school of the foundations 
of probability was construction of a universal testing procedure and re- 
alization (Levin, 1976) that there exists a forecasting strategy that pro- 
duces ideal forecasts. Levin's ideal forecasting strategy, however, is not 
computable. Its more practical versions can be obtained from the results 
of game-theoretic probability theory. For a wide class of forecasting pro- 
tocols, it can be shown that for any computable game-theoretic law of 
probability there exists a computable forecasting strategy that produces 
ideal forecasts, as far as this law of probability is concerned. Choosing 
suitable laws of probability we can ensure that the forecasts agree with 
reality in requisite ways. 



Probability forecasts that are known to agree with reality can be used 
for making good decisions: the most straightforward procedure is to select 
decisions that are optimal under the forecasts (the principle of minimum 
expected loss). This gives, inter alia, a powerful tool for competitive on- 
line learning; I will describe its use for designing prediction algorithms 
that satisfy the property of universal consistency and its more practical 
versions. 

In conclusion of the paper I will discuss some limitations of competitive 
on-line learning and possible directions of further research. 
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1 Introduction 

This paper is based on my invited talk at the 19th Annual Conference on Learn- 
ing Theory (Pittsburgh, PA, June 24, 2006). In recent years COLT invited talks 
have tended to aim at establishing connections between the traditional concerns 
of the learning community and the work done by other communities (such as 
game theory, statistics, information theory, and optimization). Following this 
tradition, I will argue that some ideas from the foundations of probability can 
be fruitfully applied in competitive on-line learning. 

In this paper I will use the following informal taxonomy of predictions (rem- 
iniscent of Shafer's (32!, Figure 2, taxonomy of probabilities): 

D-predictions are mere Decisions. They can never be true or false but can be 
good or bad. Their quality is typically evaluated with a loss function. 

S-predictions are Statements about reality. They can be tested and, if found 
inadequate, rejected as false. 

F-predictions (or Frequentist predictions) are intermediate between D-predic- 
tions and S-predictions. They are successful if they match the frequencies 
of various observed events. 
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Traditionally, learning theory in general and competitive on-line learning in par- 
ticular consider D-predictions. I will start, in Section |21 from a simple asymp- 
totic result about D-predictions: there exists a universally consistent on-line 
prediction algorithm (randomized if the loss function is not required to be con- 
vex in the prediction). Section O is devoted to S-prediction and Section^ to 
F-prediction. We will see that S-prediction is more fundamental than, and can 
serve as a tool for, F-prediction. Section explains how F-prediction (and so, 
indirectly, S-prediction) is relevant for D-prediction. In Section [7| I will prove 
the result of Section |21 about universal consistency, as well as its non-asymptotic 
version. 

2 Universal consistency 

In all prediction protocols in this paper every player can see the other players' 
moves made so far (they are perfect-information protocols). The most basic one 
is: 

Prediction protocol 

FORn= 1,2,...: 

Reality announces a;„ S X. 

Predictor announces 7„ G F. 

Reality announces ?/„ G Y. 
END FOR. 

At the beginning of each round n Predictor is given some data x„ relevant to 
predicting the following observation ?;„; Xn may contain information about n 
and the previous observations j/n-i, yn-2, ■ ■ ■ ■ The data is taken from the data 
space X and the observations from the observation space Y. The predictions 
7„ are taken from the prediction space F, and a prediction's quality in light of 
the actual observation is measured by a loss function A:XxrxY— s-M. This 
is how we formalize D-predictions. The prediction protocol will sometimes be 
referred to as the "prediction game" (in general, "protocol" and "game" will be 
used as synonyms, with a tendency to use "protocol" when the players' goals 
are not clearly stated; for example, a prediction game is a prediction protocol 
complemented by a loss function). 

We will always assume that the data space X, the prediction space F, and the 
observation space Y are non-empty topological spaces and that the loss function 
A is continuous. Moreover, we are mainly interested in the case where X, F, 
and Y are locally compact metric spaces, the prime examples being Euclidean 
spaces and their open and closed subsets. Traditionally only loss functions 
A(a;,7,y) = A(7,?/) that do not depend on x are considered in learning theory, 
and this case appears to be most useful and interesting. The reader might prefer 
to concentrate on this case. 

Predictor's total loss over the first N rounds is X^^^Li ^{^n7 Vn)- As usual 
in competitive on-line prediction (see 9 for a recent book-length review of the 
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field), Predictor competes with a wide range of prediction rules I? : X — F. Tfie 
total loss of such a prediction rule is X]ri=i M^n, D(x„), yn), and so Predictor's 
goal is to achieve 

N N 
^ A(x„,7„,y„) ^ ^ X{Xn,D{Xn),yn) (1) 
n—1 n—1 

for all = 1, 2, . . . and as many prediction rules D as possible. 

Predictor's strategies in the prediction protocol will be called on-line predic- 
tion algorithms (or strategies). 

Remark 1 Some common prediction games are not about prediction at all, as 
this word is usually understood. For example, in Cover's game of sequential 
investment ([H], Chapter 10) with K stocks, 

Y := [0, oo)^, F := {(51, . . . , 5;^) e [0, cx.)^ | .gi + • • • + gK = l}, 

K 

>'{{gi,---,9K),{yu---:yK)) := -In^^fey^. 

(there is no X; or, more formally, X consists of one element which is omitted 
from our notation). The observation y is interpreted as the ratios of the closing 
to opening price of the K stocks and the "prediction" 7 is the proportions of 
the investor's capital invested in different stocks at the beginning of the round. 
The loss function is the minus logarithmic increase in the investor's capital. 
In this example 7 can hardly be called a prediction: in fact it is a decision 
made by a small decision maker, i.e., decision maker whose actions do not affect 
Reality's future behavior (see Section |H1 for a further discussion of this aspect of 
competitive on-line prediction). For other games of this kind, see |52| . 



Universal consistency for deterministic prediction algo- 
rithms 

Let us say that a set in a topological space is precompact if its closure is compact. 
In Euclidean spaces, precompactness means boundedness. An on-line prediction 
algorithm is universally consistent for a loss function A if its predictions 7„ 
always satisfy 

{{_xi,X2, ■ ■ .} and y2, ■ ■ .} are precompact) 

/ 1 ^ 1 ^ \ 

=> limsup I — ^ A(a;„,7„, ?/„)-— ^ A(a;„,D(a;„),2/«) ] <0 (2) 

^^"^ V n=l 71=1 J 

for any continuous prediction rule _D : X — > F. The intuition behind the an- 
tecedent of (O, in the Euclidean case, is that the prediction algorithm knows 
that and ||t/„|| are bounded but does not know an upper bound in advance. 
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Of course, universal consistency is only a minimal requirement for successful pre- 
diction; we will also be interested in bounds on the predictive performance of 
our algorithms. 

Let us say that the loss function A is compact-type if for each pair of compact 
sets A ex. and B CY and each constant M there exists a compact set C C F 
such that 

Vx e A,7 ^ C,y e B : A(a;,7,?;) > M. 

More intuitively, we require that X{x,^,y) — > cxd as 7 — > oo uniformly in {x,y) 
ranging over a compact set. 

Theorem 1 Suppose X and Y are locally compact metric spaces, T is a convex 
subset of a Frechet space, and the loss function X{x, 7, y) is continuous, compact- 
type, and convex in the variable 7 G F. There exists a universally consistent 
on-line prediction algorithm. 

To have a specific example in mind, the reader might check that X — M.^ , 
F = Y = M^, and X{x, 7, y) := \\y — 7|| satisfy the conditions of the theorem. 

Universal consistency for randomized prediction algorithms 

When the loss function X{x,"f,y) is not convex in 7, two difficulties appear: 

• the conclusion of Theorem ^ becomes false if the convexity requirement is 
removed (P9 , Theorem 2); 

• in some cases the notion of a continuous prediction rule becomes vacuous: 
e.g., there are no non-constant continuous prediction rules when F = {0, 1} 
and X is connected. 

To overcome these difficulties, we consider randomized prediction rules and ran- 
domized on-line prediction algorithms (with independent randomizations). It 
will follow from the proof of Theorem that one can still guarantee that ((SJ 
holds, although with probability one; on the other hand, there will be a vast 
supply of continuous prediction rules. 

Remark 2 In fact, the second difficulty is more apparent than real: for exam- 
ple, in the binary case (Y — {0, 1}) with the loss function A(7, y) independent of 
X, there are many non-trivial continuous prediction rules in the canonical form 
of the prediction game (45) with the prediction set redefined as the boundary of 
the set of superpredictions [Blj . 

A randomized prediction rule is a function Z? : X ^ ^(r) mapping the data 
space into the probability measures on the prediction space; V{T) is always 
equipped with the topology of weak convergence A randomized on-line pre- 
diction algorithm is an on-line prediction algorithm in the extended prediction 
game with the prediction space 'P(F). Let us say that a randomized on-line 
prediction algorithm is universally consistent if, for any continuous randomized 
prediction rule Z) : X ^ ■p(r), 
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[{xi,X2, ■ ■ ■} and {yi,y2, ■ ■ ■} are precompact) 

=^ limSUp — ^ MXn,gn,yn) ~ X! MXn,dn,yn) < a.S. (3) 




where gi,g2, ■ ■ ■ ,di,d2, ■ ■ ■ are independent random variables with </„ distributed 
as 7„ and d„ distributed as D{xn), n = 1,2,.... Intuitively, the "a.s." in ^ 
refers to the algorithm's and prediction rule's internal randomization. 

Theorem 2 Let X and Y be locally compact metric spaces, T be a metric space, 
and X be a continuous and compact-type loss function. There exists a universally 
consistent randomized on-line prediction algorithm. 

Let X be a metric space. For any discrete (e.g., finite) subset {xi,X2, ■ . .} 
of X and any sequence jn € 'P(r) of probability measures on T there exists a 
continuous randomized prediction rule D such that D(xn) — Jn for all n (indeed, 
it suffices to set D{x) := J2n 4>n{x)jn, where (/)„ : X ^ [0, 1], n — 1,2,..., are 
continuous functions with disjoint supports such that (/'n(x„) = 1 for all n). 
Therefore, there is no shortage of randomized prediction rules. 

Continuity, compactness, and the statistical notion of uni- 
versal consistency 

In the statistical setting, where (a;„,?/„) are assumed to be generated indepen- 
dently from the same probability measure, the definition of universal consistency 
was given by Stone 'iT in 1977. One difference of Stone's definition from ours 
is the lack of the requirement that D should be continuous in his definition. 

If the requirement of continuity of D is dropped from our definition, universal 
consistency becomes impossible to achieve: Reality can easily choose a;„ c, 
where c is a point of discontinuity of D, and ?;„ in such a way that Predictor's 
loss will inevitably be much larger than D's. To be more specific, suppose 
X = r = Y= [—1,1] and \{x, 7, y) — \y — 7I (more generally, the loss is zero 
when y — "f and positive when y 7^ 7). No matter how Predictor chooses his 
predictions 7„, Reality can choose 




i=l 



where the function sign is defined as 




if 7 > 
otherwise. 



and thus foil |5J| for the prediction rule 




1 




otherwise. 
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(Indeed, these definitions imply D{xn) = — sign7„ = y„ for all n.) 

A positive argument in favor of the requirement of continuity of D is that it is 
natural for Predictor to compete only with computable prediction rules, and con- 
tinuity is often regarded as a necessary condition for computability (Brouwer's 
"continuity principle" ) . 

Another difference of Stone's definition is that compactness does not play any 
special role in it (cf. the antecedent of It is easy to see that the condition 
that {xi,X2, ■ ■ ■} and {yi, 2/2, • ■ •} are precompact is essential in our framework. 
Indeed, let us suppose, e.g., that {xi,X2, . . .} is allowed not to be precompact, 
continuing to assume that X is a metric space and also assuming that Y is 
a convex subset of a topological vector space. Reality can then choose x„, 
n = 1, 2, . . ., as a discrete set in X (^S], 4.1.17). Let 0„ : X ^ [0, 1], n = 1, 2, . . ., 
be continuous functions with disjoint supports such that 0„(x„) — 1 for all n. 
For any sequence of observations yi, ?/2, • ■ •, the function D{x) :— J2n 4>n{x)yn 
is a continuous prediction rule such that D[xn) = Vn for all n. Under such 
circumstances it is impossible to compete with all continuous prediction rules 
unless the loss function satisfies some very special properties. 

As compared to competitive on-line prediction, the statistical setting is 
rather restrictive. Compactness and continuity may be said to be satisfied au- 
tomatically: under mild conditions, every measurable prediction rule can be 
arbitrarily well approximated by a continuous one (according to Luzin's the- 
orem, 7.5.2, combined with the Tietze-Uryson theorem, ^3], 2.1.8), and 
every probability measure is almost concentrated on a compact set (according 
to Ulam's theorem, [H], 7.1.4). 

3 Defensive forecasting 

In this and next sections we will discuss S-prediction and F-prediction, which 
will prepare way for proving Theorems ^ and [3 

Remark 3 In this paper, S-predictions and F-predictions will always be proba- 
bility measures, whereas typical D-predictions are not measures. This difference 
is, however, accidental: e.g., in the problem of on-line regression (as in Sec- 
tion 5) different kinds of predictions are objects of the same nature. 

Testing predictions in measure-theoretic probability and 
neutral measures 

S-predictions are empirical statements about the future; they may turn out true 
or false as the time passes. For such statements to be non- vacuous, we need to 
have a clear idea of when they become falsified by future observations . In 
principle, the issuer of S-predictions should agree in advance to a protocol of 
testing his predictions. It can be said that such a protocol provides an empirical 
meaning to the predictions. 

Testing is, of course, a well-developed area of statistics (see, e.g., ^01) Chap- 
ter 3). A typical problem is: given a probability measure (the "null hypothesis" ) 
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P on a set f2, which observations to G ft falsify P? In the context of this paper, 
P is an S-prediction, or, as we will often say, a probability forecast for w € f2. 
Developing Kolmogorov's ideas (see, e.g., Section 4, [23], and Martin- 
Lof (1966, [221) defines a (in some sense, "the") universal statistical test for a 
computable P. Levin (1976, [2S|) modifies Martin-Lof's definition of statistical 
test (which was, in essence, the standard statistical definition) and extends it to 
noncomputable P; Levin's 1976 definition is "uniform" , in an important sense. 

Levin's test is a function i : x P(0) — > [0, oo], where V{^1) is the set of 
all Borel probability measures on H., assumed to be a topological space. Levin 
pS) considers the case $7 = {0, 1}°° but notes that his argument works for any 
other "good" compact space with a countable base. We will assume that is a 
metric compact (which is equivalent to Levin's assumption that f2 is a compact 
space with a countable base, j^, 4.2.8), endowing P(ri) with the topology 
of weak convergence (see below for references). Let us say that a function 
t : fl X V{ft) [0, oo] is a test of randomness if it is lower semicontinuous and. 



The intuition behind this definition is that if wc first choose a test t, then 
observe oj, and then find that t{Lu,P) is very large for the observed w, we are 
entitled to reject the hypothesis that w was generated from P (notice that the 
P-probability that t{u!, P)> C cannot exceed 1/C, for any C > 0). 

The following fundamental result is due to Levin ([IHIj footnote *^^-'), although 
our proof is slightly different (for details of Levin's proof, see E], Section 5). 

Lemma 1 (Levin) Let fl be a metric compact. For any test of randomness t 
there exists a probability measure P such that 



Before proving this result, let us recall some useful facts about the probability 
measures on the metric compact il. The Banach space of all continuous func- 
tions on with the usual pointwise addition and scalar action and the sup norm 
will be denoted C{Q). By one of the Riesz representation theorems (Ql], 7.4.1; 
see also 7.1.1), the mapping I^, where I^{f) '■= / d/i, is a linear isometry 
between the set of all finite Borel measures /i on 17 with the total variation norm 
and the dual space C'{Q) to (7(0) with the standard dual norm ([311, Chapter 
4). We will identify the finite Borel measures /i on f2 with the corresponding 

G C"(ri). This makes P(0) a convex closed subset of C"(ri). 

We will be interested, however, in a different topology on C'{Q), the weakest 
topology for which all evaluation functionals ^ G C"(r^) i— > //(/), / G C(ri), are 
continuous. This topology is known as the weak* topology (PH, 3.14), and 
the topology inherited by Vi^l) is known as the topology of weak convergence 
(El) Appendix III). The point mass J^^, a; G O, is defined to be the probability 
measure concentrated at uj, 5^{{lo}) — 1. The simple example of a sequence 
of point masses Si^^ such that LUn uj as n —i- oo and LUn ^ uj for all n shows 



for all P eV{n) 




VwGl7: t(w,P)<l. 



(4) 
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that the topology of weak convergence is different from the dual norm topology: 
Su/ri holds in one but does not hold in the other. 

It is not difficult to check that P{^^) remains a closed subset of C'{fl) in the 
weak* topology (T!, III. 2. 7, Proposition 7). By the Banach-Alaoglu theorem 
Cpi). 3.15) V{^) is compact in the topology of weak convergence (this is a 
special case of Prokhorov's theorem, Appendix III, Theorem 6). In the rest 
of this paper, P{^^) (and all other spaces of probability measures) are always 
equipped with the topology of weak convergence. 

Since 17 is a metric compact, Viyi) is also metrizable (by the well-known 
Prokhorov metric: jB], Appendix III, Theorem 6). 

Proof of Lemma^ If t takes value oo, redefine it as t :~ min(t, 2). For all 
P,Qe V{n) set 

0(g,P) := / t{uj,P)Q{Auj). 
Jn 

The function (j){Q, P) is linear in its first argument, Q, and lower semicontinuous 
(see Lemma 121 below) in its second argument, P. Ky Fan's minimax theorem 
(see, e.g., [2], Theorem 11.4; remember that P{^) is a compact convex subset 
of C'{Vl) equipped with the weak* topology) shows that there exists P* E P{^) 
such that 

WQeP{n): 0(g,P*)< sup 0(P,P). 

Pe-P(n) 

Therefore, 

VQeP{n): [ t{uj,P*)Q{dLj) <1, 
Jn 

and we can see that t{uj,P*) never exceeds 1. I 
This proof used the following topological lemma. 

Lemma 2 Suppose F : X x Y ^ R is a non-negative lower semicontinuous 
function defined on the product of two metric compacts, X and Y . If Q is a 
probability measure on Y , the function a; G X i— > F{x, y)Qidy) is olIso lower 
semicontinuous. 

Proof The product AT x y is also a metric compact (JSl; 3.2.4 and 4.2.2). 
According to Hahn's theorem (^S]) Problem 1.7.15(c)), there exists a non- 
decreasing sequence of (non-negative) continuous functions Fn{x,y) such that 
Fn{x,y) F{x,y) as n ^ oo for all {x,y) & X xY. Since each F„ is uni- 
formly continuous f|15|. 4.3.32), the functions Jy Fnix,y)Q{dy) are continu- 
ous, and by the monotone convergence theorem (JJ, 4.3.2) they converge to 
Jy F{x, y)Q{dy). Therefore, again by Hahn's theorem, Jy F{x, y)Q{dy) is lower 
semicontinuous. I 

Lemma^says that for any test of randomness t there is a probability forecast 
P such that t never detects any disagreement between P and the outcome w, 
whatever uj might be. 
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Gacs (^7], Section 3) defines a uniform test of randomness as a test of 
randomness that is lower semicomputable (lower semicomputability is an "ef- 
fective" version of the requirement of lower semicontinuity; this requirement is 
very natural in the context of randomness: cf. [ST], Section 3.1). He proves 
(\17\. Theorem 1) that there exists a universal (i.e., largest to within a constant 
factor) uniform test of randomness. If t{uj,P) < oo for a fixed universal test i, 
Lo is said to be random with respect to P. Applied to the universal test, Lemma 
n]says that there exists a "neutral" probability measure P, such that every w is 
random with respect to P. 

Gacs (23, Theorem 7) shows that under his definition there are no neu- 
tral measures that are computable even in the weak sense of upper or lower 
semicomputability even for the compactified set of natural numbers. Levin's 
original definition of a uniform test of randomness involved some extra con- 
ditions, which somewhat mitigate (but not solve completely) the problem of 
non-computability. 

Testing predictions in game-theoretic probability 

There is an obvious mismatch between the dynamic prediction protocol of Sec- 
tion[21and the one-step probability forecasting setting of the previous subsection. 
If we still want to fit the former into the latter, perhaps we will have to take the 
infinite sequence of data and observations, xi^yi,X2,y2T ■ ■ as w, and so take 
Q, :— (X X Y)°°. To find a probability measure satisfying a useful property, such 
as (@J) for an interesting t, might be computationally expensive. Besides, this 
would force us to assume that the x„s are also generated from P, and it would 
be preferable to keep them free of any probabilities (we cannot assume that Xn 
are given constants since they, e.g., may depend on the previous observations). 

A more convenient framework is provided by the game-theoretic foundations 
of probability. This framework was first thoroughly explored by von Mises |29[ 
ES] (see |37j, Chapter 2, for von Mises's precursors), and a serious shortcoming 
of von Mises's theory was corrected by Ville 01]. After Ville, game-theoretic 
probability was dormant before being taken up by Kolmogorov 1^ . The 
independence of game-theoretic probability from the standard measure-theoretic 
probability |21 | was emphasized by Dawid (cf. his prequential principle in [111 
[T3)): see [ST] for a review. 

There is a special player in the game-theoretic protocols who is responsible 
for testing the forecasts; following 37 , this player will be called Skeptic. This 
is the protocol that we will be using in this paper: 

Testing protocol 

FORn = 1,2,...: 

Reality announces Xn G X. 
Forecaster announces P„ G ViY). 

Skeptic announces /„ : Y ^ M such that Jy /« dP„ < 0. 
Reality announces j/„ € Y. 
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K-n := /C„_i + fniUn)- 

END FOR. 

Skeptic's move CRn be interpreted cis taking a, long position in ci security 
that pays fniVn) after ?/„ becomes known; according to Forecaster's beliefs en- 
capsulated in Pn, Skeptic does not have to pay anything for this. We write 
/y /" d-Pn < to mean that fn dP„ exists and is non-positive. Skeptic starts 
from some initial capital /Co, which is not specified in the protocol; the evolution 
of K-n, however, is described. 

A game-theoretic procedure of testing Forecaster's performance is a strategy 
for Skeptic in the testing protocol. If Skeptic starts from JCq := 1, plays so that 
he never risks bankruptcy (we say that he risks bankruptcy if his move /„ makes 
it possible for Reality to choose y„ making /C„ negative), and ends up with a 
very large value JCn of his capital, we are entitled to reject the forecasts as false. 
Informally, the role of Skeptic is to detect disagreement between the forecasts 
and the actual observations, and the current size of his capital tells us how 
successful he is at achieving this goal. 

Defensive forecasting 

Levin's LemmaQJcan be applied to any testing procedure t (test of randomness) 
to produce forecasts that are ideal as far as that testing procedure is concerned. 
Such ideal forecasts will be called "defensive forecasts"; in this subsection we 
will be discussing a similar procedure of defensive forecasting in game-theoretic 
probability. 

Let us now slightly change the testing protocol: suppose that right after 
Reality's first move in each round Skeptic announces his strategy for the rest of 
that round. 

Defensive forecasting protocol 

FOR 71 = 1,2,...: 

Reality announces a;„ S X. 

Skeptic announces a lower semicontinuous F„ : Y x 7'(Y) — + M 

such that /y F„(?;, P)P(d?/) < for all P e 7'(Y). 
Forecaster announces P„ G V{Y). 
Reality announces ?/„ S Y. 

ICn :== ICn-l + Fn{yn,Pn). 

END FOR. 

This protocol will be used in the situation where Skeptic has chosen in advance, 
and told Forecaster about, his testing strategy. However, the game-theoretic 
analogue of Levin's lemma holds even when Skeptic's strategy is disclosed in a 
piecemeal manner, as in our protocol. 

The following lemma can be proven in the same way as (and is a simple corol- 
lary of) Levin's Lemma^ Its version was first obtained by Akimichi Takemura 
in 2004 1321 ■ 
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Lemma 3 (Takemura) Let Y be a metric compact. In the defensive fore- 
casting protocol, Forecaster can play in such a way that Skeptic 's capital never 
increases, no matter how he and Reality play. 

Proof For all P,Q E ^(Y) set 

0(Q,P) -.^ J^F.Jy,P)Qidy), 

where Fn is Skeptic's move in round n. The function (f>{Q, P) is linear in Q 
and lower semicontinuous in P (the latter also follows from Lemma |5| if we 
notice that the assumption that F is non-negative can be removed: every lower 
semicontinuous function on a compact set is bounded below) . Ky Fan's minimax 
theorem shows that there exists P* such that 

0(g,p*)< sup 0(p,p)<o, 

P67'(Y) 

and we can see that Fn{y,P*) is always non-positive. Since the increment 
K,n — K-n-i equals Fn{yn, Pn), it suffices to set Pn ■= P*. I 



Testing and laws of probability 

There are many interesting ways of testing probability forecasts. In fact, every 
law of probability provides a way of testing probability forecasts (and vice versa, 
any way of testing probability forecasts can be regarded as a law of probability). 
As a simple example, consider the strong law of large numbers in the binary 
case (Y = {0,1}): 

1 ^ 

n=l 

with probability one, where p„ :— P„({1}) is the predicted probability that 
2/„ = 1. If © is violated, we are justified in rejecting the forecasts p„; in this 
sense the strong law of large numbers can serve as a test. 

In game-theoretic probability theory, the binary strong law of large numbers 
is stated as follows: Skeptic has a strategy that, when started with ICq :— 1, 
never risks bankruptcy and makes Skeptic infinitely rich when Q is violated. 
We prove many such game-theoretic laws of probability in 37 ; all of them 
exhibit strategies (continuous or easily made continuous) for Skeptic that make 
him rich when some property of agreement (such as, apart from various laws of 
large numbers, the law of the iterated logarithm and the central limit theorem) 
between the forecasts and the actual observations is violated. When Forecaster 
plays the strategy of defensive forecasting against such a strategy for Skeptic, 
the property of agreement is guaranteed to be satisfied, no matter how Reality 
plays. 

In the next section we will apply the procedure of defensive forecasting to 
a law of large numbers found by Kolmogorov in 1929 (HOI; its simple game- 
theoretic version can be found in |37j . Lemma 6.1 and Proposition 6.1). 
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4 Calibration and resolution 



In this section we will see how the idea of defensive forecasting can be used 
for producing F-predictions. It is interesting that the pioneering work in this 
direction by Foster and Vohra ^Hl was completely independent of Levin's idea. 
The following is our basic probability forecasting protocol (more basic than the 
protocols of the previous section). 

Probability forecasting protocol 

FORn= 1,2,...: 

Reality announces Xn £ X. 

Forecaster announces P„ € 'P(Y). 

Reality announces ?/„ G Y. 
END FOR. 

Forecaster's prediction P„ is a probability measure on Y that, intuitively, de- 
scribes his beliefs about the likely values of y„. Forecaster's strategy in this 
protocol will be called a probability forecasting strategy (or algorithm). 

Asymptotic theory of calibration and resolution 

The following is a simple asymptotic result about the possibility to ensure "cal- 
ibration" and "resolution". 

Theorem 3 Suppose X and Y are locally compact metric spaces. There is a 
probability forecasting strategy that guarantees 

[{xi,X2, ■ . ■} and {yi,y2, ■ ■ ■} are precompact) 



for all continuous functions / : X x V(Y) x Y — > M. 

This theorem will be proven at the end of this section, and in the rest of 
this subsection I will explain the intuition behind ((HJ. The discussion here is 
an extension of that in |47j . Section 6. Let us assume, for simplicity, that X 
and Y are compact metric spaces; as before, 5y, where y E Y, stands for the 
probability measure in P(Y) concentrated on {y}. 

We start from the intuitive notion of calibration (for further details, see [T5] 
and ^ni)- The probabihty forecasts P„, n = l,...,iV, are said to be "well 
calibrated" (or "unbiased in the small" , or "reliable" , or "valid" ) if, for any 




(6) 



P* e V{Y), 



n^l,...,N:Pr 




(7) 



'n=l,...,N:P, 
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provided J2n=i Af p„psp* 1 is not too small. The interpretation of {7)) is that 
the forecasts should be in agreement with the observed frequencies. We can 
rewrite JT)) as 

^n=l,...,N:P„aiP'(^yn ~ Pn) ^ ^ 
^n=l, ...,N:P„aP' 1 

Assuming that P„ w P* for a significant fraction of the n = 1, . . . , iV, we can 
further restate this as the requirement that 

^ E (5(2^") - / 9iy)Pn{dy)) « (8) 

for a wide range of continuous functions g (cf. the definition of the topology of 
weak convergence in the previous section) . 

The fact that good calibration is only a necessary condition for good fore- 
casting performance can be seen from the following standard example |13[ I16| : 
if Y = {0, 1} and 

(2/1,^2,2/3,^4, •• •) = (1,0,1,0,...), 

the forecasts P„({0}) = P„({1}) = 1/2, n = 1,2,..., are well calibrated but 
rather poor; it would be better to forecast with 

Assuming that each datum Xn contains the information about the parity of n 
(which can always be added to a;„), we can see that the problem with the former 
forecasting strategy is its lack of resolution: it does not distinguish between the 
data with odd and even n. In general, we would like each forecast P„ to be 
as specific as possible to the current datum a;„; the resolution of a probability 
forecasting algorithm is the degree to which it achieves this goal (taking it for 
granted that Xn contains all relevant information). 

Analogously to the forecasts P„, n = 1, . . . ,N, may be said to have good 
resolution if, for any x* G X, 

^ E Uvn) - I 9{y)Pnm) « (9) 

for a wide range of continuous g. We can also require that the forecasts P,i, 
n = 1, . . . ,N, should have good "calibration-cum- resolution" : for any {x* , P*) € 
X X P(Y), 



1 

TV 



E (givn)- l^9{y)Pn{dy)^ ~0 (10) 



,7V:(a:„,P„)fB(x*,P') 



for a wide range of continuous g. Notice that even if forecasts have both good cal- 
ibration and good resolution, they can still have poor calibration-cum-resolution. 

To make sense of the « in, say, we can replace each "crisp" point P* G 
P(Y) by a "fuzzy point" /p. : P(Y) [0, 1]; Ip* is required to be continuous. 
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and we might also want to have Jp. (P*) — 1 and Ip* (P) = for all P outside a 
small neighborhood of P* . (The alternative of choosing /p. :=Ia, where A is a 
small neighborhood of P* and Ia is its indicator function, does not work because 
of Oakes's and Dawid's examples |32[ll2j : /p. can, however, be arbitrarily close 
to I^.) This transforms ^ into 



which is equivalent to 

N 



^ E (/(^»' 2^") - X y)/'n(d2;)^ « 0, (11) 



where f{P,y) := Ip' {P)g{y). It is natural to require that (|ll|l should hold 
for a wide range of continuous functions /(P, y), not necessarily of the form 

ip'iP)giy)- 

In the same way we can transform Q into 



l^J2(^fi^n,yn)- l^f{Xn,y)Pn{dy)^ -0 



and (tTUll into 

^E / /(x„,P„,y)P„(dy) ) «0. 

We can see that the consequent of (0 can be interpreted as the forecasts having 
good calibration-cum-resolution; the case where f{x,P,y) depends only on P 
and y corresponds to good calibration, and the case where f{x,P,y) depends 
only on x and y corresponds to good resolution. 



Calibration-cum-resolution bounds 

A more explicit result about calibration and resolution is given in terms of 
"reproducing kernel Hilbert spaces" . Let be a Hilbert space of functions 
on a set f2 (with the pointwise operations of addition and scalar action). Its 
imbedding constant Cjr is defined by 

cyr := sup sup f{uj). (12) 

'^Gfl/e.F:||/||^<l 

We will be interested in the case cyr < 00 and will refer to T satisfying this 
condition as reproducing kernel Hilbert spaces (RKHS) with finite imbedding 
constant. 

The Hilbert space T is called a reproducing kernel Hilbert space (RKHS) 
if all evaluation functionals / e i-^ f{'^)i ^ ^ are bounded; the class of 
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RKHS with finite imbedding constant is a subclass of the class of RKHS. Let T 
be an RKHS on fi. By the Riesz-Fischer theorem, for each a; G there exists 
a function € ^ (the representer of oj in !F) such that 

f{u;) = {KJ)^, yfeT. (13) 

If f2 is a topological space and the mapping i— )■ k^^ is continuous, T is called 
a continuous RKHS. If = X x P(Y) x Y and kj^ = k^j^p^j^ is a continuous 
function of (P, y) G P(Y) x Y for each a; G X, we will say that T is forecast- 
continuous. 

Theorem 4 Let Y be a metric compact and J- be a forecast- continuous RKHS 
on "K X P(Y) X Y with finite imbedding constant Cjr. There is a probability 
forecasting strategy that guarantees 

< 2Cjr 11/11^ ViV 

for all N and all f ^ T . 

Before proving Theorem 0] we will give an example of a convenient RKHS 
T that can be used in its applications. Let us consider a finite Y, represent 
P(Y) as a simplex in a Euclidean space, and suppose that X is a bounded open 
subset of a Euclidean space. The interior Int P(Y) of 7^(Y) can be regarded as a 
bounded open subset of a Euclidean space, and so the product X x Int 7'(Y) x Y 
can also be regarded as a bounded open set 17 in a Euclidean space of dimension 
K := dimX + |Y| — 1: namely, as a disjoint union of |Y| copies of the bounded 
open set X x IntP(Y). 

For a smooth function w : R and m G {0, 1, . . .} define 




where stands for the integral with respect to the Lebesgue measure on fi, a 
runs over the multi-indices a = {ai, . . . , uk) G {0, 1, . . .}^, and 

\a\:^ai + --- + aK, D^u := —^^ 

{{ti, . . . jtx) is a typical point of the Euclidean space containing Q). Let H"^{n) 
be the completion of the set of smooth function on with respect to the norm 
According to the Sobolev imbedding theorem (PP, Theorem 4.12), H"^{n) 
can be identified with an RKHS of continuous functions on the closure f7 of O 
with a finite imbedding constant. This conclusion depends on the assumption 
m > if/2, which we will always be making. 

It is clear that every continuous function / on f2 can be approximated, 
arbitrarily closely, by a function from H"^{n): even the functions in C°°(R^), 
all of which belong to all Sobolev spaces on fl, are dense in C(S1) (T, 2.29). 



( /(a;„,P„,y„) - / f {xn, Pn,y) Pnidy) 

n=l V 
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There is little doubt that Sobolev spaces H'^{Q) are continuous under our 
assumption m > K/2 and for "nice" fl, although I am not aware of any general 
results in this direction. 

Proof of Theorem HI 

If / : O ^ 7i is a function taking values in a topological vector space H and P 
is a finite measure on its domain fl, the integral f dP will be understood in 
Pettis's (133, Definition 3.26) sense. Namely, the integral J^f dP is defined to 
he h G H such that 



for all A e H* . The existence and uniqueness of the Pettis integral is assured if 
ri is a compact topological space (with P defined on its Borel cr-algebra), Ti, is 
a Banach space, and / is continuous ([311, Theorems 3.27, 3.20, and 3.3). 

Remark 4 Another popular notion of the integral for vector- valued functions is 
Bochner's (see, e.g., [231), which is more restrictive than Pettis's (in particular, 
the Bochner integral always satisfies (jlSfl V Interestingly, the Bochner integral 
/j^ / dP exists for all measurable functions f : fl ^ H (with fl a measurable 
space) provided 7i is a separable Banach space and Jj^ ||/||^ dP < oo (this 
follows from Bochner's theorem, "53 , Theorem 1 in Section V.5, and Pettis's 
measurability theorem, 53 , the theorem in Section V.4). No topological condi- 
tions are imposed on or /, but there is the requirement of separability (which 
is essential, again by Bochner's theorem and Pettis's measurability theorem). 
This requirement, however, may be said to be satisfied automatically under the 
given sufficient conditions for the existence of the Pettis integral: since /(il) is 
a compact metric space, it is separable (JS], 4.1.18), and we can redefine Ti as 
the smallest closed linear subspace containing /(fi). Therefore, we can use all 
properties of the Bochner integral under those conditions. 

We start from a corollary (a version of Kolmogorov's 1929 result) of Lemma 



Lemma 4 Suppose Y is a metric compact. Let $„ : X x ViY) x Y Ti, 
n = 1, 2, . . be functions taking values in a Hilbert space Ti such that, for all n 
and X, ^n{x, P,y) is a continuous function of {P,y) G V{Y) x Y. There is a 
probability forecasting strategy that guarantees 




(15) 



131 




, ynJliT-l 



(16) 



n—1 n—1 



for all N, where 



(x, P, y) (x, P, y) - [ (x, P, y) P{dy) 
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Proof According to Lemma O it suffices to check; that 



S 



N 



N 



N 



-^||^'„(x„,P„,y„)||^ (17) 



is the capital process of some strategy for Skeptic in the defensive forecasting 
protocol. Since 



Sn — Sn-1 



*n {Xn,Pn,yn) + {xN,PN,yN) 



n=l 
N-1 



H 



*n {Xn,Pn,yn) 



n=l 
N-1 



\^N {xN,PN,yN)\\ 



H 



H 



= ( 2 ^ {Xn,Pn,yn) , N ixN,PN,yN) 



H 



= {A,^!N{xN,PN,yN)) 



H ' 



where we have introduced the notation A for the element 2 X^^iLi^ (^"' 2/") 
of Ti. known at the beginning of the A^th round, and, by the definition of the 
Pettis integral. 



(A, '^N [xn^Pn, y))n PN{dy) ^ ( A, ^n (xat, P^, y) PN{dy) 



H 



0, 

(18) 

the difference Sn — Sn-i coincides with Skeptic's gain in the iVth round of the 
testing protocol when he makes the valid move fN{y) {A,'^ n {xn , Pn ,y))'H- 
It remains to check that FN{y,P) ■= {A,^n {xn , P,y))-}-c will be a valid move 
in the defensive forecasting protocol, i.e., that the function Fn is lower semi- 
continuous; we will see that it is in fact continuous. By Lemma |S1 below, the 
function Jy ^n{x, P,y)P{dy) is continuous in P; therefore, the function '^n 
is continuous in {P,y). This implies that {A, n (x n , P, y)) -j-c is a continuous 
function of {P,y). I 

The proof of Lemma 0] used the following lemma. 

Lemma 5 Suppose Y is a metric compact and $ : P(Y) x Y ^ Ti is a 
continuous mapping into a Hilbert space TL. The mapping P G P(Y) h- > 
Jy ^{P,y)P{dy) is also continuous. 

Proof Let P„ ^ P as n oo; our goal is to prove that Jy ^{Pn, y)Pn{dy) — > 
Jy^{P,y)P{dy). We have: 



<i>(P„,2/)P„(d2/)- / <i>(P,2/)P(dy) 
Jy 



H 
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< 



lY JY 

The first addend on the right-hand side can be bounded above by 



(19) 



H 



J^mPr.,y) - <^{P,y)\\^P,,{dy) 



( |,'-{4| . 3.29), and the last expression tends to zero since $ is uniformly continuous 
f |15|. 4.3.32). The second addend on the right-hand side of (|19|l tends to zero 
by the continuity of the mapping Q G 7^(Y) t-^ f{v)Q{dy) for a continuous 
/ (0, III.4.2, Proposition 6). I 

The following variation on Lemma will be needed later. 

Lemma 6 Suppose X and Y are metric compacts and (f> : X x 7'(Y) x Y — > H 
is a continuous mapping into a Hilbert space Ti. The mapping (x, P) G X x 
^(Y) I— > Jy ^{x, P, y)P{dy) is also continuous. 

Proof Let .t„ x and P„ P as n oo. To prove J-y ^{xn, Pn, y)Pn{dy) 
Jy '^{^1 Pj y)P{dy) we can use a similar argument to that in the previous lemma 
applied to 

^Xn,Pn,y)Pn{dy) - P, y)Pidy) 

^Xn,Pn, y)Pn{dy) - ^x, P, y)Pn{dy) 



H 



< 



H 



$(x,P,y)P„(dy) 



$(a;,P,y)P(d2/) 



H 



Now we can begin the actual proof of Theorem 01 Take as P, y) the 
representer 'k.x,p,y of the evaluation functional f Cz J- /(x, P, y): 



if, ^.,p.y)jr = fix, P, y), y{x, P, y) e X X P(Y) xYJeT. 



Set 



J k^,p^yP{dy); 



the function \s.x,p is continuous in P by Lemma O 

Theorem 0] will easily follow from the following lemma, which itself is an 
easy implication of Lemma 0] 

Lemma 7 Let Y be a metric compact and T he a forecast- continuous RKHS 
on ^ X P(Y) X Y. There is a probability forecasting strategy that guarantees 
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n=l V 



< 11/11 



N 



\ n=l 



for all N and all f Cz J-. 



Proof Using Lemma 0] (with all 'I'n equal, '^n{x,P,y) '■— ^x,p,y — ^x,p), we 
obtain: 



E (j i^n,Pn, yn)- j^f {Xn,Pn, y) Pn{dy) 
N . 

n=l ^ 





„,p„,a)^-Pn(dy) 








< 11/11^ 


n=l 






< 11/11^^ 


^ ||kx„,P,..y„ 




n— 1 



^2;„.P„ 



UP- 



Remark 5 The algorithm of Lemma[7|is a generalization of the K29 algorithm 
of [HO]. It would be interesting also to analyze the K29* algorithm (called the 
algorithm of large numbers in r47| and jJHl ) • 

To deduce Theorem0|from Lemma[7| notice that Hkj. p ,^||^ < Cjr (by Lemma 
IHlbelow), ||k^^p||^ < /y II P{dy) < cjr, and, therefore. 



N 



Y l|kx,.,p„,y„ 

n=l 



i < AcjrN. 



This completes the proof apart from Lemma |H1 

Let T be an RKHS on il. The norm of the evaluation functional f E T t-^ 
f{uj) will be denoted by cjr(a;). It is clear that T is an RKHS with finite 
imbedding constant if and only if 



Cjr := sup Cjr(w) 



(20) 



is finite; the constants in (|20|l and (|12|l coincide. The next lemma, concluding 
the proof of Theorem^ asserts that the norm ||ki^||^ of the representer of u in 
J-' coincides with the norm cyr(uj) of the evaluation functional / i—^ /(w). 

Lemma 8 Let T he an RKHS on ft. For each oj £ fl, 



Cjc-(tj). 



(21) 
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Proof Fix Lu G fl. We are required to prove 



sup |/(w)| = llk^ll^. 

/:|I/II^<1 



The inequality < follows from 

|/HI = |(/,U^|<ll/ll^llk^ll^<llkJl^, 

where \\f\\jr < 1. The inequality > follows from 



l/MI-^^# = %7^ = l|k.,,^: 

•i-i.i \ \-fr II 1 1 -p 



where / :— k(j/||kt^||^ and ||k;j||^ is assumed to be non-zero (if it is zero, 
kj^ — 0, which implies Cjr^uj) = 0, and H21|) still holds). I 



Reproducing kernels 

In this subsection we start preparations for proving Theorem |31 But first we 
need to delve slightly deeper into the theory of RKHS. An equivalent language 
for talking about RKHS is provided by the notion of a reproducing kernel, 
and this subsection defines reproducing kernels and summarizes some of their 
properties. For a detailed discussion, see, e.g., [310] or pS] . 

The reproducing kernel of an RKHS on is the function k : ^ M 
defined by 

(equivalently, we could define k(a;,a;') as k(j(w') or as k^/(cj)). The origin of 
this name is the "reproducing property" (|13() . 

There is a simple internal characterization of reproducing kernels of RKHS. 
First, it is easy to check that the function k(w, w'), as we defined it, is symmetric, 

and positive definite. 



m m 

^^t,tjk(w„wj) > 0, 

i=i j=i 

Vm= l,2,...,(ii,...,i™) GK"\(a;i,...,o.™) e ff". 

On the other hand, for every symmetric and positive definite k : f2^ ^ R there 
exists a unique RKHS T on H. such that k is the reproducing kernel of ([3], 
Theorem 2 on p. 143). 

We can see that the notions of a reproducing kernel of RKHS and of a 
symmetric positive definite function on fl^ have the same content, and we will 
sometimes say "kernel on i7" to mean a symmetric positive definite function 
on n'^. Kernels in this sense are the main source of RKHS in learning theory: 
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cf. 021 HHI' Every kernel on X is a valid parameter for our prediction 
algorithms. In general, it is convenient to use RKHS in stating mathematical 
properties of prediction algorithms, but the algorithms themselves typically use 
the more constructive representation of RKHS via their reproducing kernels. 

It is easy to see that ^ is a continuous RKHS if and only if its reproducing 
kernel is continuous (see 001 or Appendix B of the arXiv technical report). 
A convenient equivalent definition of cjr is 



k being the reproducing kernel of an RKHS JF on il. 

Let us say that a family T of functions / : ^ M is universal if is a 
topological space and for every compact subset A oiil every continuous function 
on A can be arbitrarily well approximated in the metric C{A) by functions in 
(in the case of compact f2 this coincides with the definition given in jJU] as 
Definition 4). 

We have already noticed the obvious fact that the Sobolev spaces H™'{il.) 
on bounded open 17 C R^, K < 2m, are universal. There is a price to pay for 
the obviousness of this fact: the reproducing kernels of the Sobolev spaces are 
known only in some special cases (see, e.g., 0, Section 7.4). This complicates 
checking their continuity. 

On the other hand, some very simple continuous reproducing kernels, such 
as the Gaussian kernel 



(ll'll being the Euclidean norm and a being an arbitrary positive constant) on 
the Euclidean space M.^ and the infinite polynomial kernel 



((•, •) being the Euclidean inner product) on the Euclidean ball {u; e \ \\uj\\ < 
1}, are universal (^Oli Examples 1 and 2). Their universality is not difficult to 
prove but not obvious (and even somewhat counterintuitive in the case of the 
Gaussian kernel: a priori one might expect that only smooth functions that are 
almost linear at scales smaller than a can belong to the corresponding RKHS) . 
On the other hand, their continuity is obvious. 

Universal function space on the Hilbert cube 

Remember that the Hilbert cube is the topological space [0,1]°° (_15 , 2.3.22), 
i.e., the topological product of a countable number of closed intervals [0,1]. 
As the next step in the proof of Theorem [SJ in this subsection we construct a 
universal RKHS on the Hilbert cube with finite imbedding constant; the idea 
of the construction is to "mix" Sobolev spaces on [0, 1]^ for iiT = 1, 2, . . . (or 



cjf = Ck := sup \/]s.{u},Lo) = sup ^y\k{uJ,uJ')\ 



(22) 
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the spaces mentioned at the end of the previous subsection, for which both 
continuity and universahty are proven). 

Let !Fk, K — 1,2,..., be the set of aU functions / on the Hilbert cube 
such that f{ti,t2, ■ ■ .) depends only on ti, . . . ,tK and whose norm ffT^ (with 
51 := [0, 1]^) is finite for m := K. Equipping J-k with this norm we obtain an 
RKHS with finite imbedding constant. Let ck be the imbedding constant of 
J-'k- It wiU be convenient to modify each J-k by scahng the inner product: 

the scaled will be denoted JF^. By Ijl^l) . the representer of uj in JF^ can 
be expressed as k^ = €^^2^^^^ via the representer of uj in JF^'- Therefore, 
the imbedding constant of JF^ is 2~^/^, and it is obvious that JF^ inherits from 
J^K the property of being a universal RKHS for functions that only depend on 
ti, ■ ■ ■ , tx- 

For the reproducing kernel 'k'j^{u!,u!') of JF^ we have 



|k^(a;,a;')l = 



<\Kh, ||kL,||^,.< 2-^/^2-^/2 = 2-^ 



where k^ and k^, stand for the representers in JF^. Define an RKHS Gk as 
the set of all functions / : [0, 1]°° ^ R that can be decomposed into a sum 
f = fi + ■ ■ ■ + /k , where fk E J-j^, k ^ 1, . . . , K . The norm of / is defined as 
the infimum 



inf , 



K 



\ fc=i 



over all such decompositions. According to the theorem on p. 353 of 0], Gk is 
an RKHS whose reproducing kernel Ilk satisfies 



K 



kKiio.io') = J2Ki^^^') e [-1 + 2-^^,1-2-^] . 

The limiting RKHS of Gk, K oo, is defined in 4 , Section L9 (Case B), 
in two steps. Let JFq consist of the functions in Gk, K = 1,2, . . the J-Q-norm 
of a function g G Gk is defined as 

In general, the space J^o is not complete. Therefore, a larger space J-q is defined: 
/ G J-Q if there is a Cauchy sequence /„ in J-q such that 

Vc^ e [0, 1]°° : /(c^) = lim Ucu); (23) 

n — *-oo 

the norm of such an / is defined as 

11/11^. :==inf Jirn^||/„||^^^, 
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where the infimum is taken over all Cauchy sequences satisfying (|23|) . By The- 
orem II on p. 367 of is an RKHS with reproducing kernel 

oo 

k*(c.,c.') = E''^-(^'^')e[-i,i]; (24) 

fc=i 

therefore, its imbedding constant is finite (at most 1: see (|22|l 'l. 

Lemma 9 The RKHS J-q on the Hilbert cube is universal and continuous. 

Proof The Hilbert cube is a topological space that is both compact (by 
Tikhonov's theorem, 15 , 3.2.4) and metrizable; for concreteness, let us fix 
the metric 

oo 

p((ti,i2,...),(i'i,t2,---)):=E2"'l^fe-*fel- 

k=l 

Let / be a continuous function on the Hilbert cube. Since every continuous 
function on a compact metric space is uniformly continuous ( 15 , 4.3.32), the 
function 

g(ti,t2,...) := /(ti,...,tx,0,0,...) 

can be made arbitrarily close to /, in metric C([0, 1]°°), by making K sufficiently 
large. It remains to notice that g can be arbitrarily closely approximated by a 
function in J-'k and that every function in J^k belongs to J^q. 

The continuity of JFg follows from the Weierstrass M-test and the expression 
H24|l of its reproducing kernel via the reproducing kernels of the spaces JF^, 
K = 1,2, . . ., with imbedding constant 2~^. I 

Corollary 1 For any compact metric space there is a continuous universal 
RKHS J- on with finite imbedding constant. 

Proof It is known (j^S], 4.2.10) that every compact metric space can be homeo- 
morphically imbedded into the Hilbert cube; let _F : fi ^ [0,1]°° be such an 
imbedding. The image F{U) is a compact subset of the Hilbert cube (jl5|. 
3.1.10). Let JF be the class of ah functions f : n ~* R such that /(i^"^) : 
F(ri) R is the restriction of a function in to F{U); the norm of / is 
defined as the infimum of the norms of the extensions of f{F~^) to the whole of 
the Hilbert cube. According to the theorem on p. 351 of 0], this function space 
is an RKHS whose reproducing kernel is k(a;, u)') := k*(F(w), F{uj')), where k* 
is the reproducing kernel of J^q ; we can see that T is a. continuous RKHS with 
finite imbedding constant. 

Let us see that the RKHS is universal. Take any continuous function 
g : n ^R. By the Tietze-Uryson theorem (Hg, 2.1.8), g(F-i) : F{n) M 
can be extended to a continuous function gi on [0,1]°°. Let 52 G .Fq be a 
function that is close to gi in the C([0, 1]°°) norm. Then g2{F) : Q. R will 
belong to T and will be close to g in the C(ri) norm. I 
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Proof of Theorem [31 



We start by proving the theorem under the assumption that X and Y are 
compact metric spaces. As explained above, in this case 'P(Y) is also compact 
and metrizable; therefore, := X x V{Y) x Y is also compact and metrizable. 
Let / be a continuous real- valued function on Q; our goal is to establish the 
consequent of ®. 

Let ^ be a universal and continuous RKHS on Q with finite imbedding 
constant (cf. Corollary^. If g e JF is at a distance at most e from / in the 
C(ri) metric, we obtain from Theorem^] 



lim sup 



< lim sup 



71—1 ^ ^ 



+ 2e = 2e. (25) 



Since this can be done for any e > 0, the proof for the case of compact X and 
Y is complete. 

The rest of the proof is based on the following game (an abstract version of 
the "doubling trick", |^) played in a topological space X: 



Game of removal G{X) 

FORn = 1,2,...: 

Remover announces compact Kn C X. 

Evader announces pn ^ Kn- 
END FOR. 

Winner: Evader if the set {pi,P2, . . .} is precompact; Remover otherwise. 



Intuitively, the goal of Evader is to avoid being removed to the infinity. With- 
out loss of generality we will assume that Remover always announces a non- 
decreasing sequence of compact sets: Ki C K2 C • • • . 

Lemma 10 (Gruenhage) Remover has a winning strategy in G{X) if X is a 
locally compact and paracompact space. 

Proof We will follow the proof of Theorem 4.1 in ^Hj (the easy direction). If 
X is locally compact and a-compact, there exists a non-decreasing sequence 
Ki C K2 C • • • of compact sets covering X, and each can be extended 
to compact K* so that IntK* D Kn (JHIj 3.3.2). Remover will obviously win 
G{X) choosing Kl,K2, ... as his moves. 

If X is the sum of locally compact cr-compact spaces Xs, s d S, Remover 
plays, for each s € S, the strategy described in the previous paragraph on the 
subsequence of Evader's moves belonging to Xg. If Evader chooses pn € Xg 
for infinitely many Xg, those Xg will form an open cover of the closure of 
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{pi,P2, ■ ■ ■} without a finite subcover. If chosen from only finitely many 

Xg, there will be infinitely many a;„ chosen from some Xg, and the result of the 
previous paragraph can be applied. It remains to remember that each locally 
compact paracompact can be represented as the sum of locally compact a- 
compact subsets (23, 5.1.27). I 

Now it is easy to prove the general theorem. Forecaster's strategy ensuring 
((nj will be constructed from his strategies S{A, B) ensuring the consequent of 
© under the condition Vn : (a;„,?/„) ^ Ax B for given compact sets yl C X 
and i? C Y and from Remover's winning strategy in G'(X x Y) (remember that, 
by Stone's theorem, |15| . 5.1.3, all metric space are paracompact and that the 
product of two locally compact spaces is locally compact, [IJ, 3.3.13; therefore. 
Lemma ^1 is applicable to G'(X x Y)). Without loss of generality we assume 
that Remover's moves are always of the form A x B for A C X and _B C Y. 
Forecaster will be playing two games in parallel: the probability forecasting 
game and the auxiliary game of removal G'(X x Y) (in the role of Evader). 

Forecaster asks Remover to make his first move Ai x Bi in the game of 
removal. He then plays the probability forecasting game using the strategy 
S{Ai,Bi) until Reality chooses {xn,yn) ^ Ai x Bi (forever if Reality never 
chooses such (a;„, ?/„)). As soon as such (xn,yn) is chosen, Forecaster, in his 
Evader hat, announces (a;„,?/„) and notes Remover's move {A2,B2). He then 
plays the probability forecasting game using the strategy 5(^2, B2) until Reality 
chooses {xn,yn) ^ A2 X B2, etc. 

Let us check that this strategy for Forecaster will always ensure © . If Re- 
ality chooses {xmVn) outside Forecaster's current Ak x Bk finitely often, the 
consequent of © will be satisfied. If Reality chooses [XmVn) outside Fore- 
caster's current A^ x Bk infinitely often, the set {{xn, Hn) \ n = 1,2,...} will not 
be precompact, and so the antecedent of © will be violated. 

5 Implications for probability theory 

This section is an aside; its results are not used in the rest of the paper. 

As we discussed at the end of Section^ the procedure of defensive forecasting 
can be applied to virtually any law of probability (stated game-theoretically) 
to obtain a probability forecasting strategy whose forecasts are guaranteed to 
satisfy this law. Unfortunately, the standard laws of probability theory are often 
not strong enough to produce interesting probability forecasting strategies (HO), 
Section 4.1). In particular, for the purpose of this paper it would be easiest to 
apply the procedure of defensive forecasting to a law of probability asserting 
that ijnj holds for all continuous functions / simultaneously with probability 
one. I am not aware of such results, but in the derivation of Theorem |21 we 
essentially proved one. In this section this result will be stated formally (as 
Theorem 01 . 

In general, it can be hoped that probability theory and competitive on-line 
prediction have a potential to enrich each other; not only laws of probability 
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can be translated into probability forecasting strategies via defensive forecasting, 
but also the needs of competitive on-line prediction can help identify and fill 
gaps in the existing probability theory. 

Game-theoretic result 

Let us say that Skeptic can force some property E of the players' moves 
XmPmUn, n = 1,2,..., in the testing protocol if he has a strategy guaran- 
teeing that (1) his capital /C„ is always non-negative, and (2) either E is sat- 
isfied or lim„^oo ^ra = OO- The properties that can be forced by Skeptic are 
the game-theoretic analogue of the properties that hold with probability one in 
measure-theoretic probability theory ((37j, Section 8.1). 

The following is a corollary from the proof (rather than the statement, which 
is why we also call it a theorem) of Theorem 13 Its interpretation is that the 
true probabilities have good calibration-cum-resolution. 

Theorem 5 Suppose X and Y are locally compact metric spaces. Skeptic can 
force 

{{xi,X2,---} and {j/i, ?/2, ■ • ■} are precompact) =^ 



in the testing protocol, where f ranges over all continuous functions / : X x 
V{Y) X Y ^ M. 

Proof of Theorem [H 

We will follow the proof of Theorem starting from an analogue of Lemma 0] 

Lemma 11 Suppose Y is a metric compact. Let $ : X x V{Y) x Y — > 7i he a 

function taking values in a Hilhert space TL such that, for each x, $(a;, P, y) is a 
continuous function of (P, y) G P(Y) x Y. Suppose sup^. p ,^ -Pj < ^ 

and set 




(26) 




Skeptic can force 




(27) 



as N 



oo. 



Proof Let 



c := sup \\-^{x,P,y)\\.^ < oo. 



x,P,y 
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For k,N =1,2,..., define 



QK . 

— ^ 



2^ + Sn if c^N < 2^ 
S%_i otherwise, 



where Sn is defined as in H17() (with in place of in all references to the 
proof of Lemma 01). Let us check that 

oo 

5^ ^ fc-22-'=5^ (28) 
fc=i 

is a capital process (obviously non-negative) of a strategy for Skeptic started 
with a finite initial capital. Since 5*0 = 2'', the initial capital X^fc^i = 7r^/6 
is indeed finite. It is also easy to see that the series ()28|l is convergent and that 
lfTH|) still holds, where 

oo N-1 

A=J2 fc"'2-'=2 J2 yn) 
k=K n=l 

for some K. 

Skeptic can force S*^ < C, where C can depend on the path 

Xl,Pl,yi,X2,P2,y2, ■ ■ ■ 

chosen by the players (see Lemma 3.1 in |37| or, for a simpler argument, the 
end of the proof of Theorem 3 in 021 )■ Therefore, he can force k~'^2~^S^ < C 
for all k. Setting k := [log(c^A^)] (with log standing for the binary logarithm), 
we can rewrite the inequality 5*^ < Ck'^2'' as 

2'' + Sn < Ck^2\ 

which implies 

N 2 
^ * {Xn,P,i,yn) 



< Ck'^2^ 

H 



< C (log(c2iV) + 1)' 2'°g(^'^)+i = O (iVlog2 iV) 



The following analogue of Theorem 01 immediately follows from Lemma ITTI 
and the proof of Lemma [T] 

Lemma 12 Let Y be a metric compact and T he a forecast- continuous RKHS 
on X X V{Y) X Y with finite imbedding constant. Skeptic can force 

N , „ s 

^ /(a^„,P„,y„)- / /(a;„,P„,2/)P„(dy) = O ViVlogiv) 
as N OO, where the O is uniform in f £ J-. 
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In its turn Lemma El immediately implies the statement of Theorem |21 in the 
case of compact X and Y (where the antecedent of (|26() is automatically true): 
we can use the same argument based on (|^ . 

Now let X and Y be any locally compact metric spaces. Skeptic can use the 
same method based on Remover's winning strategy in the game of removal as 
that used by Forecaster in the proof of Theorem |3| (see p. |5SJ. This completes 
the proof of Theorem |S| 

Measure-theoretic result 

In this subsection we will use some notions of measure-theoretic probability the- 
ory, such as regular conditional distributions; all needed background information 
can be found in, e.g., [5^ . 

Corollary 2 Suppose Tn, ft = 0, 1, . . is a filtration (increasing sequence of 
a-algehras), X and Y are compact metric spaces, Xn, n — 1,2, . . are Tn-i- 
measurable random elements taking values in X, y„, n — 1,2,..., are Tn- 
measurable random elements taking values in Y, and Pn G 7'(Y) are regular 
conditional distributions of yn given Tn-i- Then 

V/ : ^lirn^ ^ (j (xn, Pn,yn)- J^f {xn, Pn, v) Pn{Ay)^ = (29) 

holds with probability one, where f ranges over all continuous functions f : 
X X r{Y) X Y ^ M. 

Proof Since X and Y are automatically complete and separable, regular condi- 
tional distributions exist by the corollary of Theorem II. 7. 5 in (39j. Our deriva- 
tion of Corollary Elfrom Theorem will follow the standard recipe (|SZ|j Section 
8.1). 

Skeptic's strategy forcing H29|l (i.e., the consequent of 126|l ') can be cho- 
sen measurable (in the sense that fn{y) is a measurable function of y and 
the previous moves xi, Pi, . . . , a;„, P„). This makes his capital process /C„, 
n = 0, 1, . . ., a martingale (in the usual measure-theoretic sense) with respect 
to the filtration {!Fn)- This martingale is non- negative and tends to infinity 
where (|29|) fails; standard results of probability theory (such as Doob's inequal- 
ity. Theorem VII. 3.1. Ill, or Doob's convergence theorem, Theorem 
VII. 4.1) imply that holds with probability one. I 

6 Defensive forecasting for decision making: 
asymptotic theory 

Our D-prcdiction algorithms are built on top of probability forecasting algo- 
rithms: D-predictions are found by minimizing the expected loss, with the ex- 
pectation taken with respect to the probability forecast. The first problem that 
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we have to deal with is the possibihty that the minimizer of the expected loss will 
be a discontinuous function, whereas continuity is essential for the method of 
defensive forecasting (cf. Theorem O where / has to be a continuous function). 



Continuity of choice functions 



It will be convenient to use the notation 



A(x,7,P) f X{x,j,y)P{dy), 



Jy 



where P is a probability measure on Y. Let us say that G : X x 7'(Y) ^ F is 
a (precise) choice function if it satisfies 



As we said, a serious problem in implementing the expected loss minimization 
principle is that there might not exist a continuous choice function G; this is true 
even if X, F, and Y are metric compacts and the loss function is continuous. 
If, however, the loss function A(a;,7,y) is convex in 7 g F, there exists an 
approximate choice function (although a precise choice function may still not 
exist). 

The simplest example of a prediction game is perhaps the simple prediction 
game, in which there are no data, F = Y = {0, 1} and A(7,?/) := \y — j\ 
(omitting the xs from our notation). There are no continuous approximate 
choice functions in this case, since there are no non-trivial (taking more than 
one value) continuous functions from the connected space ■P(Y) to F. If we 
allow randomized predictions, the simple prediction game effectively transforms 
into the following absolute loss game: F = [0, 1], Y = {0, 1}, X{^,y) := \y — j\. 
Intuitively, the prediction 7 in this game is the bias of the coin tossed to choose 
the prediction in the simple prediction game, and \y — 7I is the expected loss in 
the latter. 

Unfortunately, there is still no continuous choice function in the absolute 
loss game. It is easy to check that any choice function G must satisfy 



but the case P({1}) = 1/2 is a point of bifurcation: both predictions 7 = 1 and 
7 = are optimal, as indeed is every prediction in between. If P({1}) = 1/2, 
the predictor finds himself in a position of Buridan's ass: he has several equally 
attractive decisions to choose from. It is clear that G defined by H30|l cannot be 
continuously extended to the whole of P({0, 1}). 

We have to look for approximate choice functions. Under natural compact- 
ness and convexity conditions, they exist by the following lemma. 

Lemma 13 Let X be a paracompact, Y be a non-empty compact convex subset 
of a topological vector space, and f : X y.Y ^ be a continuous function such 



\{x, G[x, P),P) = inf \{x, -i,P), Vx e X, P e P(Y). 



7er 




(30) 
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that f{x,y) is convex in y E Y for each x G X. For any e > there exists a 
continuous "approximate choice function" g : X Y such that 

yxeX: f{x,g{x))< iiif f{x,y) + e. (31) 

yeY 

Proof Each {x,y) E X xY has a neighborhood A^^y x Bx^y such that A^^y and 
Bx,y are open sets in X and Y, respectively, and 

sup / - inf f < L 

For each x G X choose a finite subcover of the cover {^4^; y x Bx,y \ x G Ax,y^ y G 
Y} of {x} X Y and let Ax be the intersection of all Ax^y in this subcover. The 
sets Ax constitute an open cover of X such that 

{xi £ Ax,x2 £ Ax) \f{xi,y)- f{x2,y)\ < | (32) 

for all X £ X and y £ Y. Since X is paracompact, there exists (^Hl, Theorem 
5.1.9) a locally finite partition {4>i | ? G /} of unity subordinated to the open 
cover of X formed hy all Ax, x £ X. For each i £ I choose Xi £ X such that 
(f'ii^i) > (without loss of generality we can assume that such Xi exists for each 
i £ I) and choose yi £ argmiuj^ fi^i, y). Now we can set 

ai^) ■=^4>i{x)y^- 

Inequality H31(l follows, by H32(l and the convexity of f{x,y) in y, from 

yy £Y : f{x, g{x)) = / ^a^, X! ^^i^)yi^ - <^»(^)/ V^) 
< 'Y^(j)^{x)f [x^^yi) + I < '^<l}i{x)f {xi,y) + | 

i i 

< ^ Mx)f (x, y) + e = f{x, y) + e, 

i 

where i ranges over the finite number of i G / for which (j)i{x) is non-zero. I 

Suppose that X and Y are compact metric spaces, F is a compact convex 
subset of a topological vector space, and X(x, 7, y) is continuous in (x, 7, y) and 
convex in 7 G F (therefore, by LemmaEl -^(a;,7,-P) is continuous in {x,j,P) £ 
X X F X 'P(Y), and it is convex in 7). Taking X x ViY) as X and F as Y, 
we can see that for each e > there exists an approximate choice function G 
satisfying 

A(x,G(a;,P),P) < inf A(x,7,P) + e, VxgX,PgP(Y). (33) 

7sr 
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Proof of a weak form of Theorem [T] 

Suppose X and Y are compact metric spaces and F is a compact convex subset 
of a topological vector space. In this subsection we will prove the existence of a 
prediction algorithm guaranteeing (|2Jl (whose antecedent can now be ignored) 
with < replaced by < e for all continuous prediction rules D for an arbitrarily 
small constant e > 0. Let G satisfy ij^ . If Predictor chooses his predictions by 
applying the approximate choice function G to Xn and probability forecasts P„ 
for yn satisfying jnj of Theorem |3 we will have 

N N 



A(x„,7„,y„) = ^ X{Xn,G{Xn,Pn),yn) 
n—1 n—1 
N N 

A(x„, G(x„, P„), P„) + J2 (^( 

n—1 n—1 

N N 

- A(.Xn, G{Xn,Pn). Pn) + o{N) < ^ X{Xn,D{Xn)^ Pn) + + o{N) 

n—1 n—1 
N N 

\{Xn-,D{Xn),yn)-'Y[x{Xn,D{Xn), y„) ~ X{Xn , D{Xn) , Pn)j +eN + o{N) 

N 

= ^A(a:„,i?(a:„),2/„) + e^^ + o(^)- (34) 



n—1 n—1 

N 



7 Defensive forecasting for decision making: 
loss bounds 

The goal of this section is to finish the proof of Theorem ^ and to establish its 
non-asymptotic version. We will start with the latter. 

Results 

Let T be an RKHS on X x Y with finite imbedding constant. For each prediction 
rule £> : X ^ F, define a function : X x Y ^ M by 

XD{x,y) := X{x,D{x),y). 

The notation \\f\\jr will be used for all functions / : X x Y ^ M: we just 
set ll/lljjr := oo for f ^ !F. We will continue to use the notation Cjr for the 
imbedding constant (defined by H12|) . where := X x Y). Set 

Ca sup X{x,j,y)- inf X{x,j,y); 

a;eX,7er.yeY a:eX,7er,yeY 

this is finite if A is continuous and X,F, Y are compact. 
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Theorem 6 Suppose X and Y are compact metric spaces, T is a convex com- 
pact subset of a topological vector space and the loss function X{x,"f,y) is con- 
tinuous in {x,"f,y) and convex in 7 e F. Let J- be a forecast- continuous RKHS 
on X X Y with finite imbedding constant Cjr. There is an on-line prediction 
algorithm that guarantees 

N N 

^A(a;„,7„,2/„) < ^ A(a;„, y„) + yc^+4cJ(||AD||^ + 1) ViV + 1 

n—1 n—1 

(35) 

for all prediction rules D and all N = 1,2,.... 

An application of Hoeffding's inequality immediately gives the following 
corollary (we postpone the details of the simple proof until p. I37|) . 

Corollary 3 Suppose X, F, Y are compact metric spaces and the loss function 
A is continuous. Let N G {1,2,...} and 6 € (0,1). There is a randomized 
on-line prediction algorithm achieving 

N N 

^X{xn,gn,yn) < ^ A(a::„ , d„, ?/„) 

n—1 n—1 

+ ^cl+4c%{\\XD\\:^ + l)VN + cx^2\n^VN+l 

with probability at least 1—5 for any randomized prediction rule Z) : X ^ 
7^(r); gn and dn are independent random variables distributed as 7„ and Dixn), 
respectively. 

The above results are non- vacuous only when Ad is an element of the function 
space J-. If is a Sobolev space, this condition follows from D being in the 
Sobolev space and the smoothness of A. For example, Moser proved in 1966 the 
following result concerning composition in Sobolev spaces. Let $7 be a smooth 
bounded domain in and m be an integer number satisfying 2m > K . If 
u G H'^iVL) and $ S C""(]R), then $ o u € iJ™(rj) (see [SH; for further results, 
see IHl). 

Two special cases of calibration-cum-resolution 

In the chain (|34|l we applied the law of large numbers (the property of good 
calibration-cum-resolution) twice: in the third and fifth equalities. It is easy 
to see, however, that in fact the fifth equality depends only on resolution and 
the third equality, although it depends on calibration-cum-resolution, involves 
a known function / (in the notation of ©). We will say that the fifth equality 
depends on "general resolution" whereas the third equality depends on "specific 
calibration-cum-resolution" . This limited character of the required calibration- 
cum-resolution becomes important for obtaining good bounds on the predic- 
tive performance: in the following subsections we will construct prediction al- 
gorithms that satisfy the properties of specific calibration-cum-resolution and 
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general resolution and merge them into one algorithm; wc will start from the 
last step. 



Synthesis of prediction algorithms 

The following corollary of Lemma 0] will allow us to construct prediction 
algorithms that achieve two goals simultaneously (specific calibration-cum- 
resolution and general resolution). 

Corollary 4 Let Y be a metric compact and <i>„ j : X x 7'(Y) x Y ^ TCj, 

n — 1,2, . . j — 0^ 1, be functions taking values in Hilbert spaces TLj and such 
that ^n.jix, P,y) is continuous in {P,y) for all n and both j. Let oq and ai be 
two positive constants. There is a probability forecasting strategy that guarantees 

N 

^ ^ ^ n^j ('^n 1 Pn ; ^n) 
n=l 

1 ^ 

< — X! ("0 ll*n,o(a;n,i'n,yn)||^^ + 1 1 ^n, 1 , P„ , y„) 1 1 
Oj V 
n—1 

for all N and for both j ~ and j = 1, where 

(x, P, y) $„^, (x, P^V)~ *«J (a--, P, V) P{dy)- 

Proof Define the "weighted direct sum" H of Ho and Hi as the Cartesian 
product Ho X Hi equipped with the inner product 

1 

{9,9')h = {{90,91), (90, 9'i))n ■=^a..j{gj,g'j)H,- 
Now we can define $ : X x P(Y) x Y H by 

<i>n{x,p,y) := i^nAx,P,y),'^nAx,P,y)) ■ 

It is clear that ^n{x, P, y) is continuous in {P, y) for all n. Applying the strategy 
of Lemma 0] to it and using (|15|l . we obtain 

N 

^ ^ ^ n.j{'Xn, Pn, 
n=l 

' N N \ 

'^n,o{Xn,Pn, yn), ^ '^n,l{Xn,Pn, J/n) 
n—1 n—1 / 



N 



*n(a;n,Pn,2/n) 



n=l 



< J2 ll*n(a;n,^'n,2/n)|| 
■H n=l 



H 
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TV 1 

^ ^ ^ ^ II ^n,j ('^m Pjij yn) II 
n=l j=0 



Suppose X, r, Y are metric compacts and is a forecast-continuous RKHS 
on X X Y. Let G„ : X x V{Y) ^ F be a sequence of approximate choice 
functions satisfying 

Xix, Gnix, P), P) < inf X{x, 7, P) + 2"", Vx G X, P e 7'(Y) 

(they exist by \6'6\i ). Corollary ^ wiU be applied to ao = ai = 1 and to the 
mappings 

^nA^:P:y) ■= X{x , Gn{x , P) , y) - X{x , Gn{x , P) , P) , (36) 

-^nAx^P^y) ■.= k^,y -k^^P, (37) 

where k^^y is the evaluation functional at (x, y) for J- and k^^p is the mean of 
kx^y with respect to P{dy). It is easy to see that 

\\^nAx,P,y)\\R=\-^nAx,P,y)\<CX, ||«'„4(X,P,J/)||^<2C^. (38) 

Specific calibration-cum-resolution 

Corollary ^ immediately implies: 

Lemma 14 The probability forecasting strategy of Corollary ^ based on iV6]j 
and | |g7| ) guarantees 

N 



^ ^ ^ A(Xj^ , (^Xn 7 Pn) 1 yn) '^(•^n i (^^j^ , Pj^ ) , Pn)^ 



n=l 

Proof This follows from 

N 



< Jcl + ic^VN. 



^ ^ ^A(Xyi , Gn (^Xji , Pn)^ Un) X{Xji , (^n : Pn) ; -^n , 



n=l 



N 



(see (EHl) 



General resolution I 

The following lemma is proven similarly to Lemma 

Lemma 15 The probability forecasting strategy of Corollary ^ based on \36]) 
and | |g7| ) guarantees 



N 



](^XiXn,D{Xn),yn)- \{x^,DiXn),Pn)) < ^ cl + icjr \\ \d\\ :f VN ■ 
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Proof This follows from 



N 



(^X{Xn,D{Xn), Vn) - K^n-, D{Xn), Pn 
N 

^(a_d(x„,?/„) - Ai3(x„,P„) 



JV 



N 



3^ 



N 



. ^(c2+44) = ^c2+4c^||Az,||^V]V 
\ ji=i 

(we have used Corollary^ and I 
Proof of Theorem El 

Let 7„ := G„(a;„,P„) where P„ are produced by the probability forecasting 
strategy of Corollary 21 based on H3()|) and H37|) . Following (|34|l and using the 
previous two lemmas, we obtain: 



N 



N 



A(x„,7„, j/„) = ^ A(a;„,G„(a;„,P„),y„) 

n— 1 n— 1 

TV 

^ ^ A(xj2, Gjii^x^j Pji) 7 Pji) 

n=l 
N 

+ ^ ] ^A(x„, G„(Xn, Pn), yn) " A(a;„, Gn{Xm Pn) , Pn)^ 



N 



< ^ A(a;„,G„(a;„,P„),P„) + ^'0^+44^^ 



n=l 
N 



< J2 KXn,DiXn), Pn) + + 44ViV + 1 



n=l 
JV 



^ A(a;„,D(x„),y„) + y^c2 +4c2,\/iV+ 1 

n=l 
N 

- '^(^\{xn,D{xn),yn) " A(a;„ , P'(a;„) , P„ ) 



N 



< ^A(a;„,D(x„),?;„) + VcA+4c|.(||AD||jr + l)\/iV + l. 



n=l 
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Proof of Corollary [SI 

Since A(a;„, j/„) — A(x„, c?„, ?/„) never exceeds C;^ in absolute value, HoefFding's 
inequality ([3], Corollary A.l) shows that 



' N N 

(^X{xn,gn, Un) " A(x„, d„, y„)^ - ^ {\{xn-, 7n, Un) " A(a;„ , Z) (a;„ ) , \ 



> t S < cxp - 



2clN 



for every t > Q. Choosing t satisfying 



exp 



2clN 



^5, 



I.e., 



we obtain the statement of Corollary |31 
General resolution II 

To prove Theorem ^ we will need the following variation on Lemma [131 

Lemma 16 The probability forecasting strategy of Corollary ^ based on l^3b]) 
and J5'7| ) guarantees 

N 



for any f ^ T . 

Proof Following the proof of Lemma [T^ 



V ( /(a;„,y„) - / f{xn,y)Pnidy) 

N 



< 11/11 



N 



n=l 



< 11/11 



N 



\ E(4 + 44) = V'^A + 44ll/II^V]v. 

\ 71=1 
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Proof of Theorem [T] 

As in the proof of Theorem^ we first assume that X, F, and Y are compact. Let 
us first see that the prediction algorithm of Theorem|Hlfed with a suitable RKHS 
guarantees the consequent of l(2Jl for all continuous D. Let !F he a. universal and 
continuous RKHS on X x Y with finite imbedding constant Cjr. 

Fix a continuous decision rule D : X — > F. For any e > 0, we can find a 
function f ^ T that is e-close in C(X x Y) to A(a;, D{x),y). Following and 
the similar chain in the proof of Theorem |S1 we obtain: 

N N 



n— 1 n— 1 

N 

n=l 
N 



N 



<Y,KXn.D{Xn),Pn) + Jcl+Ac%^/N+l 



n=l 
N 



^ A(a;„, £'(x„), y„) + y^c^ + 4c2,Vlv + 1 

n=l 

JV 

^ A (x„,£'(a;„ ),?/„) - A(a; D{Xn),Pn)) 



n=l 
N 



n=l 

N „ 



< Hxn,D{x^),y,,) + ^cl + 4c2, (11/11^ + 1) x/iV + 1 



ji=i 

+ 2eN. 



We can see that 



f 1 ^ 1 ^ \ 

limsup I — Y Hxn,Jn,yn) ~ ]^ X! -^(^^n , (a^n ) , J/n) 1 < 2e; 

since this is true for any e > 0, the consequent of (O holds. 
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It remains to get rid of the assumption of compactness of X, F, and Y. We 
will need the following lemma. 

Lemma 17 Under the conditions of Theorem^ for each pair of compact sets 
A C X and B C Y there exists a compact set C — C(j4, B) QT such that for 
each continuous prediction rule D .~K. T there exists a continuous prediction 
rule D' : C that dominates D in the sense 

Vx£A,yeB: X{x, D' {x),y) < X{x, D{x),y). (39) 

Proof Without loss of generality A and B are assumed non-empty. Fix any 
70 G F. Let 

Ml := sup X{x,jo,y), 

{x,y)eAxB 

let Ci C F be a compact set such that 

Wx e A,-/ Ci,y e B : X{x,j,y)>Mi + l, 

let 

M2 := sup X{x,-f,y). 

{x,'y,y)eAxCixB 

and let C2 C F be a compact set such that 

Vx e A,-f C2,y e B : X{x,j,y)>M2 + l. 

It is obvious that Mi < M2 and 70 G C*! C C2. 

Let us now check that Ci lies inside the interior of C2. Indeed, for any fixed 
{x,y) & Ax B and 7 G Ci, we have A(a;, 7, y) < M2; since X{x, 7', y) > M2 + 1 
for all 7' ^ C2, some neighborhood of 7 will lie completely in C2. 

Let _D : X ^ F be a continuous prediction rule. We will show that H39|l 
holds for some continuous prediction rule D' taking values in the compact set 
C2. Namely, we define 

D'ix) := 

D{x) if D{x) G Ci 

p(D(x),T\C2) r)/„N , p(D(x).c^) -r r,. X rz ru\r. 

p(D(x),Ci)+p(D(x),T\C2)^^-^ > ^ p(D(x),Ci)+p(D(x),T\C2y^ " ^^•^> ^ ^2 \ Oi 

.70 iiD{x)^T\C2 

where p is the metric on F; the denominator p{D{x),Ci) + p{D{x),T \ C2) 
is always positive since already p[D{x),Ci) is positive. Assuming C2 convex 
(which can be done by [21], Theorem 3.20(c)), we can see that D' indeed takes 
values in C2. The only points x at which the continuity of D' is not obvious are 
those for which D{x) lies on the boundary of Ci. one has to use the fact that 
Ci is covered by the interior of C2 ■ 

It remains to check (|39|l : the only non-trivial case is D{x) G C2 \ Ci. By the 
convexity of X{x,^,y) in 7, the inequality in (|39|l will follow from 
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p{D{x),T\C2) 

\{x,D{x),y) 



p{D{x),C^)+p{D{x),V\C2) 

p{D{x),Ci) 



p{D{x),C\)+piD{x),T\C2) 

i.e., 



A(a;,7o,y) < H^,Dix),y), 



A(x,7o,y) < X{x,D{x),y). 

Since the left-hand side of the last inequality is at most Mi and its right-hand 
side exceeds Mi -|- 1, it holds true. I 

For each pair of compact >1 C X and B C Y fix a compact C{A, B) C T 
as in the lemma. Similarly to the proof of Theorem |21 Predictor's strategy 
ensuring ^ is constructed from Remover's winning strategy in G(X x Y) and 
from Predictor's strategies S{A, B) outputting predictions 7„ € C{A, B) and 
ensuring the consequent oi ^ ior D : C{A, B) under the assumption that 
(xm y„) e j4 X i? for given compact A C X and _B C Y. Remover's moves are 
assumed to be of the form Ax B for compact A C X and i? C Y. Predictor is 
simultaneously playing the game of removal G'(X x Y) as Evader. 

Predictor asks Remover to make his first move Ai x Bi in the game of 
removal. Predictor then plays the prediction game using the strategy Bi) 
until Reality chooses {xn,yn) ^ Ai x Bi (forever if Reality never chooses such 
ixn,yn))- As soon as such {xn,yn) is chosen, Predictor announces {xn,yn) in 
the game of removal and notes Remover's response {A2, B2). He then continues 
playing the prediction game using the strategy 5(^2, B2) until Reality chooses 
{xn,yn) i A2X B2, etc. 

Let us check that this strategy for Predictor will always ensure (O . If Reality 
chooses {xn,yn) outside Predictor's current Ak x Bk finitely often, the conse- 
quent of Q will be satisfied for all continuous _D : X C{Ak, Bk) {{Ak, Bk) 
being Remover's last move) and so, by Lemma lTTI for all continuous _D : X — s- F. 
If Reality chooses (x„,y„) outside Predictor's current Ak x Bk infinitely often, 
the set of (x„, i/„), n = 1, 2, . . ., will not be precompact, and so the antecedent 
of 10 will be violated. 



Proof of Theorem [21 

Define 

A(a;, 7, y) X{x, g, y)-f{dg), (40) 

where 7 is a probability measure on F. This is the loss function in a new 
game of prediction with the prediction space V{r). When 7 ranges over V{C) 
(identified with the subset of ^(F) consisting of the measures concentrated on 
C) for a compact C, the loss function H4U|I is continuous by LemmalHl We need 
the following analogue of LemmalTTI 

Lemma 18 Under the conditions of Theorem\^ for each pair of compact sets 
A C X and _B C Y there exists a compact set C = C{A,B) C F such that 



40 



for each continuous randomized prediction rule D : X — > V{T) there exists a 
continuous randomized prediction rule I?' : X — > 'P{C) such that \39(l holds (D' 
dominates D "on average" ) . 

Proof Define 70, Ci, and C2 as in the proof of Lemma [T7I Fix a continuous 
fimction /i : F ^ [0, 1] such that /i = 1 on Ci and /i = on F \ C2 (such 
an /i exists by the Tietze-Uryson theorem, ^^1, 2.1.8). Set f2 :— 1 — fi. 
Let Z) : X ^ ^(r) be a continuous randomized prediction rule. For each 
a; G X, split D{x) into two measures on F absolutely continuous with respect to 
D{x): Di{x) with Radon-Nikodym density /i and D2{x) with Radon-Nikodym 
density f^; set 

D'{x) ■.^D^{x) + \D2{x)\5^, 

(letting |F| := P(F) for P G 7'(F)). It is clear that D' is continuous (in the 
topology of weak convergence, as usual), takes values in V{C2), and 

\{x,D'{x),y)^ I A(x,7,y)/i(7)^(a:)(d7)+A(x,7o,y) / f2{l)D{x){A^) 



< J^Xix,j,y)hij)D{x){dj)+ j^Khf2{i)D{x){d^) 
< ^ A(x, 7, y)/i (7)^(2:) (d7) + ^ A(a;, 7, y)/2(7)i?(x)(d7) = X{x, D{x),y) 

for all (a;, y) e Ax B. I 

Fix one of the mappings {A, B) ^ C{A, B) whose existence is asserted by the 
lemma. 

We will prove that the strategy of the previous subsection with V{C{A, B)) 
in place of C{A,B) applied to the new game is universally consistent. Let 
D : 'K. V(r) be a continuous randomized prediction rule, i.e., a continuous 
prediction rule in the new game. Let (AktBk) be Remover's last move (if 
Remover makes infinitely many moves, the antecedent of (O is false, and there is 
nothing to prove), and let D' : X — > V{C{Ak, Bk)) be a continuous randomized 
prediction rule satisfying 1)39(1 with A := Ak and B := Bk- From some n 
on our randomized prediction algorithm produces 7„ G "/^(F) concentrated on 
C{Ak , Bk), and they will satisfy 

N , N 



/ W ^ N 

limsup — ^ A(x„,7„,y„) - — ^ A(a;„, i:)(a;„), ?;„) 

V n=l n=l 

f 1 ^ 1 ^ 

< limsup ( — A(x„,7„,?/„) - 



x^,D'{x^),y^) <0. (41) 



The loss function is bounded in absolute value on the compact set Ak x 
{C{Ak, Bk) U D{Ak)) x Bk by a constant c. The law of the iterated logarithm 
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(see, e.g., [^, (5.8)) implies that 



En=i i.Kxn,9n,yn) " A(a:;„ , 7„ , y„) ) 



lim sup 

N^oo 



Hn=l (A(x„,d„,?/„) - X{Xn,D{Xn),yn)) 



lim sup 



< 1 



with probability one. Combining the last two inequalities with H41|l gives 



This immediately implies Q. 

8 Conclusion 

In this section I will list what I think are interesting directions of further re- 
search. 

The data space as a bottleneck 

It is easy to see that if we set X :— J2^=o ''^^'^ 



it becomes impossible to compete even with the simplest prediction rules 
Z? : X ^ Y: there needs be no connection between the restrictions of D to 
Y" for different n. The requirement that j/i, . . . , should be compressed 
into an element Xn of a locally compact space X restricts the set of possible 
prediction rules so that it becomes manageable. We can consider X to be the 
necessary bottleneck in our notion of a prediction rule, and the requirement of 
local compactness of X makes it narrow enough for us to be able to compete 
with all continuous prediction rules. A natural question is: can the requirement 
of the local compactness of X be weakened while preserving the existence of 
on-line prediction algorithms competitive with the continuous prediction rules? 
(And it should be remembered that our (0) might be a poor formalization of 
the latter property if sizeable pieces of X cannot be expected to be compact.) 

Randomizat ion 

It appears that various aspects of randomization in this paper and competitive 
on-line prediction in general deserve further study. For example, the bound 
of Corollary O is based on the worst possible outcome of Predictor's random- 
ization and the best possible outcome of the prediction rule's randomization 
(disregarding an event of probability at most S). This is unfair to Predictor. Of 




(yi, . . . ,y„-i) 
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course, comparing the expected values of Predictor's and the prediction rule's 
loss would be an even worse solution: this would ignore the magnitude of the 
likely deviations of the loss from its expected value. It would be too crude to use 
the variance as the only indicator of the likely deviations, and it appears that 
the right formalization should involve the overall distribution of the deviations. 

A related observation is that, when using a prediction strategy based on 
defensive forecasting. Predictor needs randomization only when there are several 
very different predictions with similar expected losses with respect to the current 
probability forecast P„. Since P„ are guaranteed to agree with reality, we would 
not expect that Predictor will often find himself in such a position provided 
Reality is neutral (rather than an active opponent). Predictor's strategy will be 
almost deterministic. It would be interesting to formalize this intuition. 

Limitations of competitive on-line prediction 

In conclusion, I will briefly discuss two serious limitations of this paper. 

First, the main results of this paper only concern one-step-ahead prediction. 
In a more general framework the loss function would depend not only on y„ 
but on other future outcomes as well. There are simple ways of extending our 
results in this direction: e.g., if the loss function A = A(a;„, 7„, ?/„, Un+i) depends 
on both yn and yn+i, we could run two on-line prediction algorithms with the 
observation space Y'^, one responsible for choosing 7„ for odd n and the other 
for even n. However, cleaner and more principled approaches are needed. 

As we noted earlier (see Remark the general interpretation of D- 
predictions is that they are decisions made by a small decision maker. To see 
why the decision maker is assumed small, let us consider (^, which the kind of 
guarantee (such as (I^Sfl ) provided in competitive on-line prediction (although 
see [3], Section 7.11, for a recent advance). Predictor's and the prediction rule 
IP's losses are compared on the same sequence xi,yi,X2,y2, ■ ■ ■ of data and ob- 
servations. If Predictor is a big decision maker (i.e., his decisions affect Reality's 
future behavior) the interpretation of |^ becomes problematic: presumably, 
Xi,yi,X2,y2, ■ ■ ■ resulted from Predictor's decisions 7„, and D's loss should be 
evaluated on a different sequence: the sequence xl,yl,X2,y2, ■ ■ ■ resulting from 
D's decisions D{xn). 

The approach of this paper is based on defensive forecasting: the ability to 
produce ideal, in important aspects, probability forecasts. It is interesting that 
ideal probability forecasts are not sufficient in big decision making. As a simple 
example, consider the game where there is no X, F = Y = {0, 1}, and the loss 
function A is given by the matrix 





y-0 


y = 1 


7 = 





1 


2 


7 = 


1 


2 






Reality's strategy is ?/„ := 7„, but Predictor's initial theory is that Reality 
always chooses y„ = 0. 
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Predictor's "optimal" strategy based on his initial beliefs is to always choose 
7„ = suffering loss 1 at each step. His initial beliefs are reinforced with every 
move by Reality. Intuitively it is clear that Predictor's mistake in not choosing 
7„ = 1 is that he was being greedy (concentrated on exploitation and completely 
neglected exploration). However, 

• he acted optimally given his beliefs, 

• his beliefs have been verified by what actually happened. 

In big decision making we have to worry about what would have happened if 
we had acted in a different way. 

My hope is that game-theoretic probability has an important role to play 
in big decision making as well. A standard picture in the philosophy of science 
(see, e.g., [3311231) is that science progresses via struggle between (probabilistic) 
theories, and it is conceivable that something like this also happens in individual 
(human and animal) learning. Based on good theories (the ones that survives 
serious attempts to overthrow them) we can make good decisions. Testing of 
probabilistic theories is crucial in this process, and the game-theoretic version 
of the testing process (gambling against the theory) is much more flexible than 
the standard approach to testing statistical hypotheses: at each time we know 
to what degree the theory has been falsified. It is important, however, that the 
skeptic testing the theory should not only do this playing the imaginary game 
with the imaginary capital; he should also venture in the real world. Predictor's 
theory that Reality always chooses y„ = would not survive for more than one 
round had it been tested (by choosing a sub-optimal, from the point of view of 
the old theory, decision). 

Big decision making is a worthy goal but it is very difficult to prove anything 
about it, and elegant mathematical results might be beyond our reach for some 
time. Small decision making is also important but much easier; in many cases 
we can do it almost perfectly. 
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